Investigate TMDb Movie Dataset

Ekaterina Kuznetsova | Data Analist Nanodegree | 20.10.2021

Table of Contents

Introduction

In the second Project of my Data Analysis Nanodegree, I am investigating a Movie Database (TMDb) file, which originated from Kaggle. This Dataset has collection of important detials of about 10,000 movies, including their budget, genre, popularity and more.

Information about all given Datasets available you can find on the homepage-links present here.

Questions to answer:

  1. How much revenue brought the films in the year they were released?
  2. Which movies has the highest and the lowest profit ?
  3. Which 10 movies on TMdbs is most popular?
  4. How many movies were made yearly?
  5. Which genres are the most popular?
  6. How related the year with popularity of genre?
  7. Average Runtime of Movies?

Data Wrangling

Data Cleaning

Before answer the questions we need to prepare and clean our dataset.

First, lets drop columns. We will only keep the columns we need and remove the rest of them.

Columns to delete - id, imdb_id, budget_adj, revenue_adj, homepage, overview, production_companies, vote_count and vote_average.

The difference between the row database and cleaned obvious and more convinient for analysis.

Now lets clean any duplicated rows.

Which movies have a value of '0' in their budget, revenue and vote average? Let's delete these movies with null values from database

At the beginning our Dataset had about 10,000 entries and 21 columns of movies

and now, after cleaning left only 3853 in 12 columns.

Exploratory Data Analysis

Q1: How much revenue brought films in the year they were released?

2015 was the year where movies made the highest revenue.
More then 26 billion dollars was released in this year. For compare just 847 Millons was released in 1966 year.

Now we want to find similar characteristics of most profitable movies.

Q2 Which movie has the highest and lowest profit?

  • As we can see, the highest profit made director James Cameron with Avatar.
  • Lowest profit shows The Warrior's Way from Sngmoo Lee.
  • Q3: Top 10 most and least voted average movies on TMdb

    Above we see information about top 10 movies and worst 10 movies,
    sorted exclusively according to average vote.
    Let's visualize it with help of bubble chart.

    Most average voted Movies: The Schawshank Redemption, Stop Making Sence, The Goodfather.
    Least average voted Movies: Foodfight, Dracula 3D, FearDotCom

    Q4: How many movies were made yearly?

    Great! Now we have list with information - how many movies were made yearly.
    And we can easy find in which year were made more and least movies.

    199 Movies were released in year 2011
    and just 4 in 1969 year.

    Q5 Which genres are the most popular?

    The genre column is made up from a string of genres separated by pipes"|".

    We need to divide the movies into groups based on genres to answer the question. Otherwise, we will have to analyze 1792 combination of genres.

    Instead, we will create special function for splitting genre

    Drama has the most releas of movies(about 3900).
    Second and third place are taken by Comedy and Thriller genre.

    We have identified which genres more and least popular.
    Absolute leader is Drama (17,8%), next Comedy (13,8%) and Thriller(11,1%).

    Below is another visualisation art - pie chart diagramm,
    which gives an idea of releationship between year and genres for Top 50 Movies.

    At the pie chart we can see multiple-variable exploration of release year, genre of the movies.
    We can see that the greatest number from top 50 were produced in year 2014, 1999, 1994 and in 2015.
    What is actually surprising for me, that so many top voted movies were released before 2000.
    Totally majority of the top movies has Drama genres.
    Important to say, this exploration is only for top 50 movies and not for complete dataset, therefore chart indicates incomplete picture about what genres are most often filmed.

    Q7 Which Average Runtime of Movies?

    From this plot we can see that:

    Conclusions

    This Dataset is rich on information. Here is my conclusions:

    Citations

    The online documentations of pandas, numpy, and matplotlib.

    In case of errors i used https://stackoverflow.com/

    Limitations